Identifying Urdu Complex Predication via Bigram Extraction
نویسندگان
چکیده
A problem that crops up repeatedly in shallow and deep syntactic parsing approaches to South Asian languages like Urdu/Hindi is the proper treatment of complex predications. Problems for the NLP of complex predications are posed by their productiveness and the ill understood nature of the range of their combinatorial possibilities. This paper presents an investigation into whether fine-grained information about the distributional properties of nouns in N+V CPs can be identified by the comparatively simple process of extracting bigrams from a large “raw” corpus of Urdu. In gathering the relevant properties, we were aided by visual analytics in that we coupled our computational data analysis with interactive visual components in the analysis of the large data sets. The visualization component proved to be an essential part of our data analysis, particular for the easy visual identification of outliers and false positives. Another essential component turned out to be our language-particular knowledge and access to existing language-particular resources. Overall, we were indeed able to identify high frequency N-V complex predications as well as pick out combinations we had not been aware of before. However, a manual inspection of our results also pointed to a problem of data sparsity, despite the use of a large corpus.
منابع مشابه
N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language
Extraction of named entities (NEs) from the text is an important operation in many natural language processing applications like information extraction, question answering, machine translation etc. Since early 1990s the researchers have taken greater interest in this field and a lot of work has been done regarding Named Entity Recognition (NER) in different languages of the world. Unfortunately...
متن کاملA Hybrid Approach for NER System for Scarce Resourced Language-URDU: Integrating n-gram with Rules and Gazetteers
We present a hybrid NER (Name Entity Recognition) system for Urdu script by integration of n-gram model (unigram and bigram), rules and gazetteers. We used prefix and suffix characters for rule construction instead of first name and last name lists or potential terms on the output list that is produced by n-gram model. Evaluation of the system is performed on two corpora, the IJCNLP NE (Named E...
متن کاملNamed Entity Recognition System for Postpositional Languages: Urdu as a Case Study
Named Entity Recognition and Classification is the process of identifying named entities and classifying them into one of the classes like person name, organization name, location name, etc. In this paper, we propose a tagging scheme Begin Inside Last -2 (BIL2) for the Subject Object Verb (SOV) languages that contain postposition. We use the Urdu language as a case study. We compare the F-measu...
متن کاملProtein Structural Class Prediction via k-Separated Bigrams Using Position Specific Scoring Matrix
Protein structural class prediction (SCP) is as important task in identifying protein tertiary structure and protein functions. In this study, we propose a feature extraction technique to predict secondary structures. The technique utilizes bigram (of adjacent and k-separated amino acids) information derived from Position Specific Scoring Matrix (PSSM). The technique has shown promising results...
متن کاملEncoding event structure in Urdu/Hindi VerbNet
We propose a new kind of event structure representation for computational linguistics, based on the theoretical framework of FirstPhase Syntax (Ramchand, 2008). We show that the approach not only gives a theoretically well-motivated set of subevents and related semantic roles, it also posits the levels of representation needed for analyzing a linguistic phenomenon that has repeatedly caused pro...
متن کامل